[SPARK-54656][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in #53276

viirya · 2025-12-02T09:03:17Z

What changes were proposed in this pull request?

SupportsPushDownVariants was a Scan mix-in in #52578. This patch changes it to be a ScanBuilder mix-in to follow the established patterns in the codebase.

Why are the changes needed?

SupportsPushDownVariants was a Scan mix-in in #52578. This patch changes it to be a ScanBuilder mix-in to follow the established patterns in the codebase, e.g, join pushdown, aggregate pushdown...etc.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.0.14

…r-refactor

aokolnychyi · 2025-12-04T20:25:41Z

I personally would want the following changes to VARIANT pushdown in DSv2:

Move the logic to ScanBuilder instead of Scan (this PR attempts to do exactly that).
Evolve the connector API.

Rename interfaces to SupportsPushDownVariantExtractions / VariantExtraction (alternatives are welcome).
Pass each variant_get expression as separate VariantExtraction so that connectors can check each field easily.
Return boolean[] from pushVariantExtractions to indicate what was pushed.

interface SupportsPushDownVariantExtractions extends ScanBuilder {
  boolean[] pushVariantExtractions(VariantExtraction[] extractions);
}

interface VariantExtraction {
  String[] columnName; // variant column name
  String path; // extraction path from variant_get and try_variant_get
  DataType expectedDataType; // expected data type
}

Clearly state that connectors must only push down an extraction if the data has been shredded before. Connectors should not try to attempt to cast / extract on demand. It must be done in Spark if the data hasn't been shredded prior to the scan.

Basically consider the following example:

  SELECT
    variant_get(v, '$.data[1].a', 'string'),
    variant_get(v, '$.key', 'int'),
    variant_get(s.v2, '$.x', 'double')
  FROM tbl;

This should pass the following extractions to connector:

  VariantExtraction[] extractions = [
    new VariantExtraction(["v"], "$.data[1].a", StringType),
    new VariantExtraction(["v"], "$.key", IntegerType),
    new VariantExtraction(["s", "v2"], "$.x", DoubleType)  // ← nested in struct 's'
  ];

Connectors mark VariantExtraction as pushed ONLY if they can guarantee that ALL records satisfy the expected type, meaning the data has been shredded prior to the scan.

aokolnychyi · 2025-12-04T20:26:13Z

cc @viirya @gengliangwang @cloud-fan @dongjoon-hyun

viirya · 2025-12-04T20:36:56Z

Pass each variant_get expression as separate VariantExtraction so that connectors can check each field easily.
interface VariantExtraction {
String[] columnName; // variant column name
String path; // extraction path from variant_get and try_variant_get
DataType expectedDataType; // expected data type
}

Currently VariantAccessInfo represents an access to a variant column. So it has a member String columnName. What does a VariantExtraction represent for?

Although you said "each variant_get expression as separate VariantExtraction", if there are multiple variant_gets for same variant column, you mean to have multiple VariantExtractions? Currently they are all represented by one VariantAccessInfo for the variant column, I think it makes more sense.

viirya · 2025-12-04T20:43:03Z

Connectors mark VariantExtraction as pushed ONLY if they can guarantee that ALL records satisfy the expected type, meaning the data has been shredded prior to the scan.

I think connectors still can read and parse the variant to required type even it is not a shredded variant. From the view of Spark and DSv2 API, we don't need to know how the connectors fulfill the pushdown requirement.

aokolnychyi · 2025-12-04T20:52:06Z

Currently VariantAccessInfo represents an access to a variant column. So it has a member String columnName. What does a VariantExtraction represent for?

Although you said "each variant_get expression as separate VariantExtraction", if there are multiple variant_gets for same variant column, you mean to have multiple VariantExtractions? Currently they are all represented by one VariantAccessInfo for the variant column, I think it makes more sense.

I expect each variant_get and try_variant_get to be converted to VariantExtraction with variant column name parts and extraction JSON path. If a connector has shredded 2 out 3 requested columns, it can simply mark with booleans what it supports and what must be done in Spark. If we use VariantAccessInfo, then they would have to create a new StructType to indicate what they can extract? Seems very complicated and error-prone compared to returning booleans.

I think connectors still can read and parse the variant to required type even it is not a shredded variant. From the view of Spark and DSv2 API, we don't need to know how the connectors fulfill the pushdown requirement.

I feel this is VERY dangerous. I read through the casting logic in Spark. It has so many edge cases. There is no way connectors will replicate this behavior. We don't want to have inconsistent extraction between connectors. In the future, we may add a casting function to VariantExtraction that Spark would provide. That said, I would not do it now.

viirya · 2025-12-04T23:48:23Z

Okay, I'm updating this PR according to the ideas. I will update this soon.

aokolnychyi · 2025-12-05T00:11:30Z

Does the latest proposal make sense to you too, @cloud-fan @gengliangwang @dongjoon-hyun? Any thoughts?

dongjoon-hyun · 2025-12-05T18:42:39Z

It would be great if we can see the tangible code to review, @aokolnychyi .

Could you update the PR if it's ready, @viirya ?

viirya · 2025-12-05T18:51:41Z

Could you update the PR if it's ready, @viirya ?

I will update this once it is ready.

aokolnychyi · 2025-12-05T19:57:26Z

One more question I forgot. Do we need to make VariantExtraction extend Serializable? Is it supposed to be sent to executors? Seems like this will be done on the driver.

dongjoon-hyun · 2025-12-05T22:45:12Z

Given the current status, I made a PR to branch-4.1 to mark this interface as Experimental for Apache Spark 4.1.0 only in order to be safe.

[SPARK-54616][SQL][4.1] Mark SupportsPushDownVariants as Experimental #53354

…tal` ### What changes were proposed in this pull request? This PR aims to mark `SupportsPushDownVariants` as `Experimental` instead of `Evolving` in Apache Spark 4.1.x. ### Why are the changes needed? During Apache Spark 4.1.0 RC2, it turns out that this new `Variant` improvement feature still needs more time to stabilize. - #52522 - #52578 - #53276 - [[VOTE] Release Spark 4.1.0 (RC2)](https://lists.apache.org/thread/og4dn0g7r92qj22fdsmqoqs518k324q5) We had better mark this interface itself as `Experimental` in Apache Spark 4.1.0 while keeping it `Evolving` in `master` branch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53354 from dongjoon-hyun/SPARK-54616. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

viirya · 2025-12-08T02:46:19Z

One more question I forgot. Do we need to make VariantExtraction extend Serializable? Is it supposed to be sent to executors? Seems like this will be done on the driver.

Hmm, for the built-in ParquetScan, it is not required because it doesn't send VariantExtraction to executors. It uses the extraction info to transform schema and the physical reader uses the schema to do variant rewriting.

But probably it is required to be Serializable for third-party datasource implementions as we don't know how they will use the extraction info. They may send the info to executors and do rewriting there.

dongjoon-hyun · 2025-12-09T00:37:46Z

Thank you for updating the PR, @viirya .

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantExtraction.java

cloud-fan · 2025-12-09T08:45:33Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantExtractionImpl.java

+ * @since 4.1.0
+ */
+@Experimental
+public final class VariantExtractionImpl implements VariantExtraction, Serializable {


Does it need to be public API?

~~This can be internal, but means every datasource implementions need to implement their VariantExtraction. I think we can provide one implementation?~~

cloud-fan · 2025-12-09T08:50:58Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+            case nestedStruct: StructType =>
+              field.name +: getColumnName(nestedStruct, rest)
+            case _ =>
+              throw new IllegalArgumentException(


we should throw SparkException.internal

You mean SparkException.internalError, right?

cloud-fan · 2025-12-09T08:55:55Z

...st/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownVariantExtractions.java

+   * Pushes down variant field extractions to the data source.
+   * <p>
+   * Each element in the input array represents one field extraction operation from a variant
+   * column. Data sources should examine each extraction and determine whether it can be


Does Spark do any reconciliation? e.g. variant_get(v, '$.a', 'int') and variant_get(v, '$.b', 'int') will be two extractions, but how about variant_get(v, '$.a', 'struct<x int, y int>') and variant_get(v, '$.a.x', 'int')? Are they two extractions as well?

Yes, they are two extractions. Each separate get to a variant will be an extraction.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantExtraction.java

dongjoon-hyun

Please update the PR title with a valid JIRA ID, @viirya .

viirya · 2025-12-09T23:29:29Z

Please update the PR title with a valid JIRA ID, @viirya .

Thank you @dongjoon-hyun. I updated it.

dongjoon-hyun · 2025-12-09T23:30:19Z

Thank you so much, @viirya .

cloud-fan · 2025-12-10T00:53:41Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/VariantExtractionImpl.java

+ * @since 4.1.0
+ */
+@Experimental
+public final class VariantExtractionImpl implements VariantExtraction {


I don't see the value of separating the interface and imp if we need to expose the impl anyway. Can we follow TableInfo and make VariantExtraction a class directly?

but let me confirm again: the API means Spark passes VariantExtraction instances to the data source, and data sources need to return bool. Why do data sources need to use the impl class?

Oh, you're right. We don't need to expose VariantExtractionImpl but only the interface.

cloud-fan · 2025-12-10T02:39:53Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/VariantExtractionImpl.java

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources;


how about this package instead? https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector

Okay. I moved it there and make it Scala file instead of Java file now.

cloud-fan · 2025-12-10T15:23:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/VariantExtractionImpl.scala

+  require(expectedDataType != null, "expectedDataType cannot be null")
+  require(columnName.nonEmpty, "columnName cannot be empty")
+
+  override def equals(obj: Any): Boolean = obj match {


I think scala case class already implemented these basic methods.

Removed. Thanks.

dongjoon-hyun · 2025-12-10T17:35:44Z

Do you want to address the last Wenchen's comment, @viirya ?

viirya · 2025-12-10T17:51:30Z

Do you want to address the last Wenchen's comment, @viirya ?

Yea, addressed it. Thanks.

dongjoon-hyun

Thank you, @viirya , @cloud-fan , @gengliangwang .

Pending CIs.

dongjoon-hyun · 2025-12-10T18:26:02Z

Could you remove unused import too, @viirya ?

Error: ] /home/runner/work/spark-1/spark-1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/VariantExtractionImpl.scala:20: Unused import
Applicable -Wconf / @nowarn filters for this fatal warning: msg=<part of the message>, cat=unused-imports, site=org.apache.spark.sql.internal.connector

dongjoon-hyun · 2025-12-10T19:05:56Z

I removed it, @viirya .

viirya · 2025-12-10T19:32:43Z

I removed it, @viirya .

Okay. Thanks.

Refactor SupportsPushDownVariants to be a ScanBuilder mix-in

b268380

github-actions bot added the SQL label Dec 2, 2025

viirya mentioned this pull request Dec 2, 2025

[SPARK-53880][SQL] Fix DSv2 in PushVariantIntoScan by adding SupportsPushDownVariants #52578

Closed

viirya added 2 commits December 2, 2025 14:57

fix

712bfff

Merge remote-tracking branch 'upstream/master' into pushvariantdsv2-p…

3d92486

…r-refactor

dongjoon-hyun mentioned this pull request Dec 4, 2025

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

Closed

Refactor APIs

5722a3d

fix extractionInfo

23fc64b

dongjoon-hyun mentioned this pull request Dec 5, 2025

[SPARK-54616][SQL][4.1] Mark SupportsPushDownVariants as Experimental #53354

Closed

viirya added 2 commits December 7, 2025 18:37

fix

65b4ce2

Mark new APIs as Experimental

31cc2c1

fix doc

82b90b9

cloud-fan reviewed Dec 9, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/VariantExtraction.java Outdated Show resolved Hide resolved

viirya added 2 commits December 9, 2025 10:49

for review

e85164c

for review

87f1f55

dongjoon-hyun reviewed Dec 9, 2025

View reviewed changes

viirya changed the title ~~[SPARK-XXXXX][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in~~ [SPARK-54656][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in Dec 9, 2025

dongjoon-hyun approved these changes Dec 9, 2025

View reviewed changes

cloud-fan reviewed Dec 10, 2025

View reviewed changes

move VariantExtractionImpl

1fa1f77

cloud-fan reviewed Dec 10, 2025

View reviewed changes

cloud-fan approved these changes Dec 10, 2025

View reviewed changes

for review

e17cdda

cloud-fan reviewed Dec 10, 2025

View reviewed changes

dongjoon-hyun self-assigned this Dec 10, 2025

for review

2179fe6

gengliangwang approved these changes Dec 10, 2025

View reviewed changes

dongjoon-hyun approved these changes Dec 10, 2025

View reviewed changes

Remove unused import

b2791d3

[SPARK-54656][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in #53276

Are you sure you want to change the base?

[SPARK-54656][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in #53276

Conversation

viirya commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

aokolnychyi commented Dec 4, 2025

Uh oh!

aokolnychyi commented Dec 4, 2025

Uh oh!

viirya commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Dec 4, 2025

Uh oh!

aokolnychyi commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Dec 4, 2025

Uh oh!

aokolnychyi commented Dec 5, 2025

Uh oh!

dongjoon-hyun commented Dec 5, 2025

Uh oh!

viirya commented Dec 5, 2025

Uh oh!

aokolnychyi commented Dec 5, 2025

Uh oh!

dongjoon-hyun commented Dec 5, 2025

Uh oh!

viirya commented Dec 8, 2025

Uh oh!

dongjoon-hyun commented Dec 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Dec 9, 2025

Uh oh!

dongjoon-hyun commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

viirya commented Dec 2, 2025 •

edited

Loading

viirya commented Dec 4, 2025 •

edited

Loading

aokolnychyi commented Dec 4, 2025 •

edited

Loading

viirya Dec 9, 2025 •

edited

Loading

cloud-fan Dec 9, 2025 •

edited

Loading